-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
align with grpc base/balancer to trigger reconnect in Idle state (when we move to gRPC 1.41+) #335
Conversation
Bumps [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) from 1.11.0 to 1.11.1. - [Release notes](https://github.com/prometheus/client_golang/releases) - [Changelog](https://github.com/prometheus/client_golang/blob/main/CHANGELOG.md) - [Commits](prometheus/client_golang@v1.11.0...v1.11.1) --- updated-dependencies: - dependency-name: github.com/prometheus/client_golang dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions below. I knew this stuff pretty well at one point, but it's mostly left my head and I need to be hand-held through it a bit more. Thanks! And sorry for the delay
sc.Connect() | ||
} | ||
d.mu.Unlock() | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it important to return here, where we don't return with the Connect() call below?
As far as I can tell, the purpose is to not call UpdateState below. Is that true, and why would it a problem to do so?
(I'm trying to figure out if this code block could be eliminated, and instead have the check below be the only place where Connect() is called).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the core problem (as I understand it) is that Idle sub-connections won't actually re-connect unless sc.Connect()
is explicitly called, right?
Was this a change in gRPC behavior? I'm confused how this isn't a problem we've experienced ourselves. For that matter, help me understand: how does a previously-active SubConn come to be Idle again (my very old recollection is they bounce between failure and Connecting
)?
In any case, since the check below will cause Connect to be called, is there a particular reason that we don't want to update connState[sc]
if it's state.Idle
now?
Put another way, would the root issue would be resolve with just the addition below of:
if state.ConnectivityState == connectivity.Idle {
sc.Connect()
}
Almost certainly I'm missing something 🤷 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. I am going to re-look into the changes that occurred in gRPC over time, as it has been updated in our master a couple of times, to see if it plays any part. Agree that it is strange it has not popped up in your deployment.
A lot of my investigation was the evolution (via blame) in https://github.com/grpc/grpc-go/blob/master/balancer/base/balancer.go#L180 as we appeared to be lined up pretty closely with the overall pattern of UpdateSubConnState() but they have been tweaking it over time.
I will get back with hopefully good answers to your questions and your simplification may be valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still looking but i think this gRPC PR grpc/grpc-go@03268c8#diff-f9772420643575b997f617c6d4c1934aaa26f057042c7fa71a521ef5bb2af253 is the one where gRPC have introduced the transition to the Idle state and adjusted their own example loadbalancer to work with that.
In their example balancer/base/balancer.go they were returning in the top check for Idle state and bypassing the call to UpdateState() so i followed that. I can dig deeper on that to see why that may or may not make a difference.
In looking deeper i think this change above may be in a later version of gRPC than the gazette repo is using (as gazette go.mod shows v1.40.0). Our application mono-repo uses gRPC in a lot of services and our go.mod is at v.1.52.0. It may explain why you do not see this issue and we do (in our consumers). Let me confirm the exact version that the commit above shows up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the gRPC change above was for the 1.41.0 release which the gazette repo does not yet use. They spell out the behavior changes here: https://github.com/grpc/grpc-go/releases/tag/v1.41.0 eventhough the commit i link to is not explicitly called out it follows the balancer
#4613 PR they mention.
So my PR may be premature for wider release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will talk to Michael when he comes back next week to see if we should close this PR to master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jgraettinger i guess these(or similar) changes will be required when gRPC is upgraded to 1.41+. For now i changed the title to reflect this. I will leave open for now but will close if you think that is best.
…om/prometheus/client_golang-1.11.1' into broker-restart-reconnect
Just an FYI. Only the dispatcher.go changes relate to this PR. We somewhat polluted this PR by merging in open PR #331 (go.mod, go.sum) into our fork (as we required it also in our product). It should have been done on a new branch on our fork. |
Thanks, sorry for the long delay here. We've got a further PR that builds upon this PR to update gRPC to 1.59, that we've been testing. We're planning to land both this and that PR on Monday. |
Thanks for the heads up. |
When a k8s rolling restart occurs (e.g.
kubectl rollout restart deployment gazette
) we see the broker's IP addresses change and the consumers need to be restarted to re-establish their GRPC connections. Without a restart of the consumers they have their default service SubConns waiting forever in the connectivity Idle state on their open Read GRPCs.This change adds extra Idle state handling to the dispatcher's
UpdateSubConnState
method to align with the GRPC base balancer's implementation (https://github.com/grpc/grpc-go/blob/master/balancer/base/balancer.go#L180) where it tries to connect if in the Idle state.The end result of this change is that connectivity is re-established from the consumers to the brokers (after the brokers have restarted) but these new connections often go via the default service SubConn and so do not honor the intra-zone routing preferences (as the
Pick
method is called without a Route) as outlined below. The changes do not appear to affect the existing setting up of SubConns and improve the robustness under restarts (though not perfect from a zone perspective).Prior to the change:
After initial system bringup or restarting of the consumers the GRPC connections are following the expected intra-zone routing (in the case below the consumer is in zone:
us-central1-b
):Perform a
kubectl rollout restart deployment gazette
and the open Read GRPCs remain in the Idle state. Sometimes a few (~25% in my case) of the SubConns do successfully navigate the transition. Those that do were found to have received aNOT_JOURNAL_BROKER
response in theirbroker/client/reader.go Read
method which made them pass in a full Route to the Pick routine (not shown below as this case all Read GRPCs went Idle). This non-empty Route often is composed of new and old broker information though similar to #215 but potentially that was a temporary situation that was happening when thePick
was triggered.After the change:
(ignore that extra DJD traces that dump the dispatcher structures)
Again at initial system bringup or restarting of the consumers the GRPC connections are following the expected intra-zone routing (in the case below the consumer is in zone:
us-central1-c
):Perform a
kubectl rollout restart deployment gazette
and the open Read GRPCs reattach to the brokers.A few of them that went through the
NOT_JOURNAL_BROKER
handling were given a non-empty route and reattach directly to a broker pod. Note in the case below the first Pick route entry (gazette-756b95d7d4-mq8kf) is actually in the process of being replaced but is not chosen due its zone. I think a GRPC can cross zones if the new broker for that zone is not present in the list at the time of thePick
call.Most of the open Read GRPCs actually have the
Pick
method called with no route and connect via the default service SubConn (10.44.x.x address) which does not trigger any intra-zone rule handling to occur.This change is